library(RnaSeqTutorial)
samples <- read.csv(file.path(extdata(),"sex-samples.csv"))
res <- mclapply(dir(file.path(extdata(),"htseq"),
pattern="^[2,3].*_STAR\\.txt",
full.names=TRUE),function(fil){
read.delim(fil,header=FALSE,stringsAsFactors=FALSE)
},mc.cores=2)
names(res) <- gsub("_.*_STAR\\.txt","",dir(file.path(extdata(),"htseq"),
pattern="^[2,3].*_STAR\\.txt"))
addInfo <- c("__no_feature","__ambiguous",
"__too_low_aQual","__not_aligned",
"__alignment_not_unique")
sel <- match(addInfo,res[[1]][,1])
count.table <- do.call(cbind,lapply(res,"[",2))[-sel,]
colnames(count.table) <- names(res)
rownames(count.table) <- res[[1]][,1][-sel]
conditions <- names(res)
sex <- samples$sex[match(names(res),samples$sample)]
date <- factor(samples$date[match(names(res),samples$sample)])
pal=brewer.pal(8,"Dark2")
The QA revealed that we have a confounding factor. Luckily, enough the grouping of sex is balanced between sampling years and hence we will be able to block the year effect to investigate the sex effect.
yearSexDesign <- samples[grep("[2,3].*",samples$sample),c("sex","date")]
The primary reason to use DESeq over DESeq2 or edgeR or any other is that DESeq is very conservative. The second reason is that DESeq give less weight to outliers than DESeq2 does and based on the study by Pakull et al., which identify a candidate gene for sexual determination - the same we simultaneously, independantly originally identified in our analyses - it seems that our sample 226.1 may be mis-classified. However, the sex phenotyping has been done thoroughly, so it may also be that the sex determination is a more complex trait and that Potri.019G047300 is only one factor influencing it As introduced we want to look at the sex by blocking the year effect. We start by estimating the size factors and the dispersion
cdsFull <- newCountDataSet(count.table,yearSexDesign)
cdsFull = estimateSizeFactors( cdsFull )
cdsFull = estimateDispersions( cdsFull )
We can now check the dispersion estimated by DESeq, which is, obviously, conservative. I.e. most dispersion values lie below the actual dispersion fit by about 1 fold.
plotDispLSD(cdsFull)
Next we create both models (one considering the date only and on the date and sex)
fit1 = suppressWarnings(fitNbinomGLMs( cdsFull, count ~ date + sex ))
## ...............................
fit0 = suppressWarnings(fitNbinomGLMs( cdsFull, count ~ date))
## ...............................
For the rest of the analysis, we ignore the genes that did not converge in the previous step
sel <- !is.na(fit1$converged) & !is.na(fit0$converged) & fit1$converged & fit0$converged
fit1 <- fit1[sel,]
fit0 <- fit0[sel,]
We next calculate the GLM p-value and adjust them for multiple testing
pvalsGLM = nbinomGLMTest( fit1, fit0 )
padjGLM = p.adjust( pvalsGLM, method="BH" )
We finally visualize the obtained results. A first insight shows that a number of genes have very high fold changes, which is due the fact that these genes have very few read in very few samples. For clarity, in the following plots, we filter those genes with a too high FC.
boxplot(rowSums(count.table[rownames(fit1[fit1$sexM > 20,]),]>0),
ylab="number of sample with expression > 0",
main="distribution of the # of samples with\nexpression > 0 for genes with extreme high log2FC"
)
boxplot(rowSums(count.table[rownames(fit1[fit1$sexM < -20,]),]>0),
ylab="number of sample with expression > 0",
main="distribution of the # of samples with\nexpression > 0 for genes with extreme low log2FC"
)
The vast majority of these genes are anyway not significant.
plot(density(padjGLM[fit1$sexM > 20]),
main="Adj. p-value for genes with extreme log2FC",
xlab="Adj. p-value")
plot(density(padjGLM[fit1$sexM > -20]),
main="Adj. p-value for genes with extreme log2FC",
xlab="Adj. p-value")
So we filter them out
sel <- fit1$sexM > -20 & fit1$sexM < 20
fit1 <- fit1[sel,]
padjGLM <- padjGLM[sel]
pvalsGLM <- pvalsGLM[sel]
A further look into the data reveals that one p-value is equal to 0. As this is inadequate for plotting log odds (-log of the p-value), we change these 0s to be 1 log10 fold change smaller than the otherwise smallest p-value (for plotting purposes only!)
padjGLM[padjGLM==0] <- min(padjGLM[padjGLM!=0])/10
pvalsGLM[pvalsGLM==0] <- min(pvalsGLM[pvalsGLM!=0])/10
As a volcano plot with the non-adjusted p-values The red dots circles are the 4 most significantly differentially expressed genes, and the 8 with the absolute highest log2 fold changes (4 positive and 4 negative)
heatscatter(fit1$sexM,-log10(pvalsGLM),
main="Male vs. Female Differential Expression",cor=FALSE,
xlab="Log2 Fold Change", ylab="- log(10) p-value")
legend("topleft",bty="n","1% p-value cutoff",lty=2,col="gray")
points(fit1[pvalsGLM<.01,"sexM"],
-log10(pvalsGLM[pvalsGLM<.01]),
col="lightblue",pch=19)
pos <- c(order(pvalsGLM)[1:4],
order(fit1$sexM,decreasing=TRUE)[1:4],
order(fit1$sexM)[1:4])
points(fit1[pos,"sexM"],-log10(pvalsGLM[pos]),col="red")
abline(h=2,lty=2,col="gray")
As a volcano plot with adjusted p-values
heatscatter(fit1$sexM,-log10(padjGLM),
main="Male vs. Female Differential Expression",cor=FALSE,
xlab="Log2 Fold Change", ylab="- log(10) adj. p-value")
legend("topleft",bty="n","1% FDR cutoff",lty=2,col="gray")
points(fit1[padjGLM<.01,"sexM"],-log10(padjGLM[padjGLM<.01]),col="lightblue",pch=19)
pos <- c(order(padjGLM)[1:4],order(fit1$sexM,decreasing=TRUE)[1:4],order(fit1$sexM)[1:4])
points(fit1[pos,"sexM"],-log10(padjGLM[pos]),col="red")
abline(h=2,lty=2,col="gray")
As DESeq is an “older” implementation of this type of analysis, we will redo the analysis using DESeq2. To ultimately be able to compare the obtained set of differential expression candidate genes, we save the list of genes significant at a 1% adjusted p-value cutoff.
UmAspD <- rownames(fit1[padjGLM<.01,])
which are not so many (one)
UmAspD
## [1] "Potra000503g03273.0"
We redo the analyses performed above. As you will see, it has been made more intuitive in DESeq2 and the same can be achieved in fewer commands.
dds <- DESeqDataSetFromMatrix(
countData = count.table,
colData = data.frame(condition=conditions,
sex=sex,
date=date),
design = ~date+sex)
dds <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
plotDispEsts(dds)
res <- results(dds,contrast=c("sex","M","F"))
In 4 commands we have reproduced the analysis and we have observed that the dispersion estimation looked different. DESeq2 introduces a so called “shrinkage”, which you can learn more about in the DESeq2 vignettes. We, next, extract the results using the same 1% cutoff and plot similar validation plots
alpha=0.01
plotMA(res,alpha=alpha)
volcanoPlot(res,alpha=alpha)
## NULL
The volcano plot look similar to what we observed previously (except that we have by mistake inverted the ratio calculation, we now did F-M), although slightly less dead that in DESeq. We also do not get any genes with a p-value of 0
hist(res$padj,breaks=seq(0,1,.01))
sum(res$padj<alpha,na.rm=TRUE)
## [1] 3
4 genes are candidates for differential expression at a 1% adjusted p-value cutoff, which we also save for later comparison
UmAspD2 <- rownames(res[order(res$padj,na.last=TRUE),][1:sum(res$padj<alpha,na.rm=TRUE),])
UmAspD2
## [1] "Potra003910g23469.0" "Potra001414g11999.0" "Potra004533g25038.0"
We can already observe that one gene is common with the previous list, but that the chromosome 19 candidate has disappeared. This is surprising, given the Pakull et al. publication, that Potri.019G047300.0 (the homolog of Potra000503g03273) does not come out anymore as a differentially expressed candidate gene at a 1% FDR cutoff with DESeq2. But it’s adjusted p-value is 2.1% - see below. Moreover it’s cook distance - a measure of the homogeneity of the gene expression within a condition - is relatively high (2x the average) and might indicate the presence of an outlier. Looking into more details at the count values and sex assignment shows that the sample 226.1 behaves like a male sample with regards to that gene, so either the sample has been mis-sexed or the Pakull et al. results in P. tremuloides and P.tremula are only a partial view of the true biology.
as.data.frame(res)["Potra000503g03273.0",]
## baseMean log2FoldChange lfcSE stat pvalue padj
## Potra000503g03273.0 125 1.1 0.24 4.6 3.9e-06 0.02
data.frame(
counts=counts(dds,normalized=TRUE)["Potra000503g03273.0",],
sex,
date,
conditions)
## counts sex date conditions
## 202 7.64 F 2008 202
## 207 145.27 M 2010 207
## 213.1 0.00 F 2008 213.1
## 221 261.99 M 2008 221
## 226.1 23.02 F 2010 226.1
## 229.1 305.21 M 2008 229.1
## 229 79.92 M 2010 229
## 235 91.24 M 2010 235
## 236 0.92 F 2010 236
## 239 0.00 F 2010 239
## 244 2.22 F 2010 244
## 303 476.78 M 2008 303
## 305.3 87.38 F 2008 305.3
## 309.1 237.93 M 2008 309.1
## 310.3 49.12 F 2008 310.3
## 337.1 349.03 M 2008 337.1
## 349.2 5.15 F 2008 349.2
mcols(dds)[names(rowRanges(dds))=="Potra000503g03273.0",]
## DataFrame with 1 row and 36 columns
## baseMean baseVar allZero dispGeneEst dispFit dispersion dispIter
## <numeric> <numeric> <logical> <numeric> <numeric> <numeric> <numeric>
## 1 125 21809 FALSE 1.3 0.13 0.99 10
## dispOutlier dispMAP Intercept date2008 date2010 sexF sexM
## <logical> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## 1 FALSE 0.99 6.2 0.64 -0.64 -0.56 0.56
## SE_Intercept SE_date2008 SE_date2010 SE_sexF SE_sexM MLE_Intercept
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## 1 0.35 0.25 0.25 0.12 0.12 4.8
## MLE_date_2010_vs_2008 MLE_sex_M_vs_F WaldStatistic_Intercept
## <numeric> <numeric> <numeric>
## 1 -1.9 3.7 18
## WaldStatistic_date2008 WaldStatistic_date2010 WaldStatistic_sexF
## <numeric> <numeric> <numeric>
## 1 2.6 -2.6 -4.6
## WaldStatistic_sexM WaldPvalue_Intercept WaldPvalue_date2008
## <numeric> <numeric> <numeric>
## 1 4.6 4.7e-69 0.0099
## WaldPvalue_date2010 WaldPvalue_sexF WaldPvalue_sexM betaConv betaIter
## <numeric> <numeric> <numeric> <logical> <numeric>
## 1 0.0099 3.9e-06 3.9e-06 TRUE 18
## deviance maxCooks
## <numeric> <numeric>
## 1 183 0.77
sex[conditions=="226.1"] <- "M"
dds <- DESeqDataSetFromMatrix(
countData = count.table,
colData = data.frame(condition=conditions,
sex=sex,
date=date),
design = ~ date+sex)
dds <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
res.swap <- results(dds,contrast=c("sex","M","F"))
plotMA(res.swap,alpha=alpha)
volcanoPlot(res.swap,alpha=alpha)
## NULL
hist(res.swap$padj,breaks=seq(0,1,.01))
sum(res.swap$padj<alpha,na.rm=TRUE)
## [1] 7
UmAspD2.swap <- rownames(res.swap[order(res.swap$padj,na.last=TRUE),][1:sum(res.swap$padj<alpha,na.rm=TRUE),])
UmAspD2.swap
## [1] "Potra000503g03273.0" "Potra001218g10512.0" "Potra001519g12651.0"
## [4] "Potra003910g23469.0" "Potra005027g34078.0" "Potra181824g28189.0"
## [7] "Potra002176g16783.0"
Fair enough, the gene made it back in the list.
sel <- conditions != "226.1"
dds <- DESeqDataSetFromMatrix(
countData = count.table[,sel],
colData = data.frame(condition=conditions[sel],
sex=sex[sel],
date=date[sel]),
design = ~ date+sex)
dds <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
plotDispEsts(dds)
res.sel <- results(dds,contrast=c("sex","M","F"))
plotMA(res.sel,alpha=alpha)
volcanoPlot(res.sel,alpha=alpha)
## NULL
hist(res.sel$padj,breaks=seq(0,1,.01))
sum(res.sel$padj<alpha,na.rm=TRUE)
## [1] 8
UmAspD2.sel <- rownames(res.sel[order(res.sel$padj,na.last=TRUE),][1:sum(res.sel$padj<alpha,na.rm=TRUE),])
UmAspD2.sel
## [1] "Potra003910g23469.0" "Potra005027g34078.0" "Potra000503g03273.0"
## [4] "Potra001414g11999.0" "Potra001519g12651.0" "Potra004533g25038.0"
## [7] "Potra001218g10512.0" "Potra178727g27972.0"
We compare the results from the 4 approaches, the easiest way being to plot a Vann Diagram
plot.new()
grid.draw(venn.diagram(list(
UmAsp=UmAspD2,
UmAsp.swap=UmAspD2.swap,
UmAsp.sel=UmAspD2.sel,
UmAspD = UmAspD),
filename=NULL,
col=pal[1:4],
category.names=c("UmAsp - DESeq2",
"UmAsp - corrected",
"UmAsp - removed",
"UmAsp - DESeq")))
Unlike in the original analysis conducted in the study, where the Populus trichocarpa genome was used as a reference (the P. tremula genome assembly is still a draft assembly), where the gene Potri.014G155300.0 was constantly found by all 4 approaches and the gene Potri.019G047300.0 (Potra000503g03273.0 homolog) was found by DESeq and DESeq2, there is no overlap between both native methods. This is probably because the result in P. tremula are obscured by an assembly artefact. Indeed we know that the two genes in P. trichocarpa have a very high sequence similarity (being putatively very recent paralogs), which appears to have been collapsed into a single locus in P. tremula. We can also note that “messing” with the sample sex attribute or removing that sample leads to the identification of several more genes! It is unclear if these are the results of a technical error or if they would be worth investigating further.
sort(intersect(UmAspD2,UmAspD))
## character(0)
sort(intersect(UmAspD2.swap,UmAspD))
## [1] "Potra000503g03273.0"
sort(intersect(UmAspD2.sel,UmAspD))
## [1] "Potra000503g03273.0"
Analyses performed when asking Mike Love (DESeq2 developer) why DESeq2 seem so sensitive to misclassification The short answer was to use the Cook distance to further investigate that and that the 1% FDR cutoff was conservative. These are commented out on purpose. Note that the rowData function has been replaced by rowRanges in Bioconductor version 3.1 and above.
## mis-classified
# sex <- samples$sex[match(colnames(count.table),samples$sample)]
# date <- factor(samples$date[match(colnames(count.table),samples$sample)])
# ddsM <- DESeqDataSetFromMatrix(
# countData = count.table,
# colData = data.frame(condition=conditions,
# sex=sex,
# date=date),
# design = ~ date+sex)
# ddsM <- DESeq(ddsM)
# resM <- results(ddsM,contrast=c("sex","F","M"))
#
# counts(ddsM,normalized=TRUE)["Potri.019G047300.0",]
# resM["Potri.019G047300.0",]
# mcols(ddsM)[names(rowData(ddsM)) == "Potri.019G047300.0",]
#
# ## reclassified
# sexR <- sex
# sexR[5] <- "M"
# ddsR <- DESeqDataSetFromMatrix(
# countData = count.table,
# colData = data.frame(condition=conditions,
# sex=sexR,
# date=date),
# design = ~ date+sex)
# ddsR <- DESeq(ddsR)
# resR <- results(ddsR,contrast=c("sex","F","M"))
#
# counts(ddsR,normalized=TRUE)["Potri.019G047300.0",]
# resR["Potri.019G047300.0",]
# mcols(ddsR)[names(rowData(ddsR)) == "Potri.019G047300.0",]
#
# ## samples details
# sex
# date
#
# ## the model is not be affected by the reclassification
# lapply(split(sex,date),table)
# lapply(split(sexR,date),table)
#